Heart Failure Prediction Dataset

For this project, either use your own Kaggle API and proceed with the code or download the dataset on kaggle website

Tutorial to create Kaggle API key in Python: https://stackoverflow.com/questions/49310470/using-kaggle-datasets-in-google-colab

Link to manually download the dataset : https://www.kaggle.com/fedesoriano/heart-failure-prediction/download


Estimated runtime : about 2 hours

Attribute Information

Information available in Kaggle data main page

  1. Age: age of the patient [years]
  2. Sex: sex of the patient [M: Male, F: Female]
  3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  4. RestingBP:resting blood pressure [mm Hg]
  5. Cholesterol: serum cholesterol [mm/dl]
  6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
  8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
  9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
  10. Oldpeak: oldpeak = ST [Numeric value measured in depression]
  11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  12. HeartDisease: output class [1: heart disease, 0: Normal]

I. Feature exploration

No missing value 'per say' in the dataset

There are 7 numerical data and 5 categorical data. However, looking at the data type, some of the numerical value are encoded categorical values. For instance, FastingBS is a categorical variable that is one for diabete and 0 for normal. The HeartDisease is the same. So the true number of cat/num data is 5/7

There are 11 predictors, and the response is a binary response, with 1 being Heart Disease and 0 is normal

Now, to study the different predictors associations, it is needed to understand the type of the variables used.

Indeed, some variables here are categorical dichotomous (binary), some are categorical non-dichotomous and some are numerical.

The correlation is only meaningful for 2 numerical variables together. To study the association between two non-numerical variable, the Pearson Correlation isn't relevant, and other coefficients must be used


To solve this issue, the proposed solution is:

This pairwise comparison plot is made for every single x1 and x2 combination.

The strategy was the following:


For readability purpose, the plots with a categorical and a numerical variable are the same regardless of x1 and x2 orders. As it is shown above, subplot 2:1 and 1:2 are the same.
In addition to provide a nice overview of the different relationships, some interesting comments can be made for data processing:

There are no strong correlation between each predictors. The maximum correlation coefficient is 0.4, which can be considered a medium-strength correlation.
It is important to notice that the cholesterol has a negative correlation with the Heart Disease risk, which is quiet unintuitive.
</br> A further analysis of the predictors compared to the response is presented herebelow

I.1. Age

The graphic on the top represents the number of heart disease case per age, compared to the distribution of age over the entire sample population.
The number of cases by age follow the age distribution of the entire sample population. It is needed to unbias the data from the original distribution.

To do so, the percentage of the sample population at each age is used as the response. To illustrate, if 4 people in the sample population have 74 years old, and 3 get an heart disease at this age, the response will be 75% for the age 74

Comparing the total sampling population to the affected population over age, it seems that there is in fact a direct correlation between Age and risk of Heart Disease (which is expected, according to current medical knowledge).

The older one get, the more likely a heart disease will occur. The plot of occurence of heart disease over age is skewed due to the distribution of the sample population age, which isn't uniform as it is shown herebelow. It seems to follow a gaussian distribution with a mean around 50

I.2. Sex

For the sex of the patient, 63% of men got a heart disease, and only about 26% of women. Here the difference is pretty significant, men are more likely to suffer an heart disease than woman.

I.3. Chest Pain Type

These two plots shows the type of chest pain count for both population, and the percentage of each type of pain with heart disease. Two things can be said from these plot.

I.4.Resting Blood Pressure

The Resting Blood Pressure distribution seems to be gaussian centered on 130. There is one outlier to drop, with a resting bp of 0, which basically would mean dead.

For the frequency plot, it doesn't seems to have once again a particular trend. It shows that the number of heart disease increases with the number of people in the one category of age. </br> </br>Then the repartition of heart disease depending on the RestingBP is a quiet noisy graph. Moreover, one value makes no sense (0), because the blood pressure is expected to be a positive number. It seems though that the percentage increases with the RestingBP once this odd value being taken out. Applying a filter to denoise the data clearly confirm that assumption.



The chosen filter is the Savitzky-Golay filter, because of its simplicity of use. It fits a n order polymomial for each i, using k points for the prediction. The new value is then $\hat{f}(i)$.

In the example herebelow, it fits a linear model using 21 points for prediction.

I.5.Cholesterol

The cholesterol level is a bit tricky. The 0 represents a big proportion of the sample population, so it would be better to keep it in the dataset </br></br> The very high value are sever case of cholesterol, these are possible values.

First of all, it is important to see the responses associated to this 0

Most of the 0 are associated with a heart disease (over 88%). Knowing that, it is very likely that these 0 come from a dataset where cholesterol wasn't reported. (Data comes from several different dataset).

It is obvious that these runs cannot be used as it is, because the model will be biased by these missing values.

Because, all of the other features for this data are relevant, it would be better to find a way to keep these values.

Some simple solutions are proposed in a first stage:

Option 1 : Cholesterol as a binary variable

For the first option, the strategy is to set a threshold splitting high cholesterol and low cholesterol into two categories

Both value are binary, so the Point biserial correlation is a better indicator of a possible correlation between the variables.

When treated as categorical, the cholesterol level has a weak correlation with the risk of heart disease.

On this sample, it is important to notice that about 37% of people without cholesterol are healthy, and about 50% for people with cholesterol.

These numbers are against the current medical knowledge. A high cholesterol is known to induce cardiac problems

Option 2 : Cholesterol as a numerical variable

Another strategy is to deal with the cholesterol value as numerical value. But to do so, it is needed to 'guess' the possible values for the missing cholesterol. Knowing that about 90% of the missing cholesterol population had an heart disease, one guess would be that the average cholesterol value for this population was quiet high.

Then it is possible to generate a list of cholesterol value following a gaussian distribution and centered on a high cholesterol value.

There are no clear evidence of any impact of the cholesterol level over the risk of getting an heart disease using a gaussian distribution to replace the missing value.

Moreover the point biserial result is very low, meaning that there is a very weak correlation between this new data and the risk of heart disease.

It is needed to replace this missing values, thus an entire part of this project will be dedicated to the imputation of a new cholesterol value.

I.6. Fasting Blood Sugar

Fasting Blood Sugar is an indicator of Diabete. According to the data, and medical knowledge, diabete increases the odds of Heart Disease. In fact, without diabete, the odds are almost 50/50 of getting an heart disease. However, the presence of diabete makes this value rise to 80%.

I.7.Resting ECG

For the resting ECG, it is very mitigated. One would have guess that a normal ECG means a lower risk of Heart Disease, but this is slightly the case. An abnormal ECG increases the odds of Heart Disease (ie. ST).

I.8. Maximum Heart Rate

There are no clear outliers for the MaxHR variable. Once again, the repartition of the values could be approximated by a gaussian distribution centered on 130-140.

The Maximum heart rate is one of the most correlated feature with the response according to the correlation matrix, with a correlation of -0.4 . Even though the count is skewed by the sample distribution, which is not uniform, the repartition in percentage shows a clear decreasing trend between the percentage of population having a heart disease an the maximum HR.

I.9. Exercice Angina

There is an obvious correlation between Heart Disease and Exercise Angina (chest pain due to exercise), as 35% of population without Exercise Angina had an Heart Disease, but around 85% for those with Exercise Angina

I.10. Old Peak

Oldpeak is a measurement of the deviation over a flat line of an ECG segment. This is measured as a depression, and negative values should become positive.

The Oldpeak is clearly correlated with heartdisease, as expected. The higher it gets, the higher the odds of heart disease gets. And it makes sense, as this value depicts how weak is the heart

I.11. ST Slope

Before diving into the analysis of this feature, it is better to understand its meaning. On an ECG, after an heartbeat, the signal values sent to the heart decreases, becoming negative after a peak. The ST slope is then the trend of the section between the S and the start of T (in the picture below). Usually it tends to have a slight upward trend. However, in some cases, mostly related with myocardial infection, it can have a a flat or downslopping shape.

Pic

ST Slope is one of the highest correlated feature with heart disease. Indeed, as the graphs shows, an heart disease is very likely to happen if the slope is flat or downslopping. The risks are increased by 400% if the slope is not going upward

Summary

Now that every single feature has been analyzed a first time, some conclusion can be drawn, and data can be transformed toward a more meaningful shape.

  1. Age has a positive correlation;
  2. Men are most likely to get heart disease than women;
  3. Most of the population with heart disease does not have any particular chest pain;
  4. Resting Blood Pressure has a positive correlation, one outlier needs to be removed;
  5. Cholesterol needs to be preprocessed as there are 0 values;
  6. Having diabete increases the odds of having an heart disease;
  7. A normal Resting ECG does not decrease the odds;
  8. Maximum HR has a negative correlation;
  9. An Exercice angina is highly correlated;
  10. Old peak feature has a positive correlation;
  11. The ST_slope being something else than Up increases the odds of heart disease of 400%;

II. Preprocessing

Two actions need to be taken:

Removing the outlier is a fairly simple operation:

II.1. Cholesterol missing values imputation

There are 171 missing values.

In order to keep the 170 runs with the cholesterol missing values, it is necessary to find a way to replace these 0s.
The strategy here is to find a model predicting the cholesterol value that have a fairly good prediction rate

The metric used to compare these imputing methods is chosen to be the accuracy of the classifiers

First, it is needed to transform the categorical variable, one hot encoding them.

The data is split this way:

II.1.a. Simple Imputing

The use of the sklearn.imput class allows a quick lookout on different simple and elaborated imputing techniques.
The simple imputers are:

With iterative imputers, a model is fitted to predict the missing values, the different models are:

The base code has been built over this piece of code to rank different imputers.

https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html#sphx-glr-auto-examples-impute-plot-missing-values-py

For simple and iterative imputers, the best score is most of the time 'Clean Data'. However, by fine tuning some of the classifiers, and adding other models to this list, it might be possible to add more accuracy to these predictions

II.2.b. Custom Imputer

The scikitlearn.impute class didn't yield satisfactory results, and all the advanced algorithms used were not fine tuned. Moreover, other algorithms can be used.

The model accuracy evaluation is done as followed:

1. ExtraTreesRegressor
a. RandomGridSearch
b. GridSearchCV
c. Accuracy score
2. KN Regressor
a. RandomGridSearch
b. GridSearchCV
c. Accuracy score
3.Xgboost
a. RandomGridSearch
b. GridSearchCV
c. Accuracy Score
4.Elastic Net

Elastic Net being very quick to execute, there is no need to perform a RandomGridSearch on it

5.Non Linear regression using Neural Network

Because of the time of execution, only one neural network will be trained. A further investigation could lead to an enhanced accuracy.

As it is possible to see on the figures above, there are no big sign of overfitting. (The training data curve does not continue to decrease while the validation data curve stays stable)

II.1.c. Imputer Choice

The choice of imputer depends on the accuracy of the imputer. Due to the stochastic nature of the process, the imputer choice might differ from one run to another.

The negative correlation between Cholesterol and HeartDisease became slightly positive.

The classifier used an untuned RandomForestRegressor to find the 'best imputer'. As a reminder, the accuracy was about (it varies because of the stochastic nature of the crossvalidation):

II.2. Feature Transformation

II.2.a. Decomposition with Principal Component Analysis


The dataset used to perform PCA contains 20 features (11 initially, but 20 once encoded). To reach a 100% explained variance, it is needed to get the 15 first components with linear PCA.
With only 9 components, 80% of the variance is explained.

Linear PCA linearity assumption makes it hard to perform decomposition if there are no existing linear combination between each of the feature. To explore non-linearity, PCA with kernel trick is used

On the graph above it seems that only Cosine Kernel manages to explain 100% Kernel in less than 20 components

Regarding 80% explained variance, cosine is very similar to linear PCA, reaching it in 9 components. However rbf and polynomial kernel reaches it in 12 components

It is possible to find the best KernelPCA kernel and hyperparameters by doing a RandomGridSearch.



Optimization for Kernel PCA:

This is a representation of the PCA decomposition in 2D for Cosine Kernel PCA and Linear PCA.
The explained variance with 4 components is 49.83% for the linear PCA, and 47.00% for the Kernel PCA with cosine function.

II.2.b. Decomposition using MCA and FAMD

Another option for dimensionality reduction would be to use Multiple Correspondence Analysis, which is PCA applied to categorical variables.

Another technique called Factor Analysis of Mixed Data (FAMD) combines Principal Component Analysis (PCA) to deal with numerical variables and Multiple Correspondence Analysis (MCA) for categorical ones.

To use MCA, it is needed to have a fully categorical data set. The strategy to convert the numerical data to categorical is:


The values are chosen arbitrarily and could be optimized to maximize the explained variance

Transforming the dataset into a categorical dataset does not yield a better explained variance. MCA with 4 components reaches 40.83% explained variance. It can be explained by the loss of information due to transformation from num-cat and the 'arbitrary' values to split the different categories.

As we can see, decomposition using PCA and MCA combined results in a similar explained variance than the linear PCA.

III. Models Exploration

Now that the imputing method has been selected, data can be used to build models. A non-exhaustive list of models has been elaborated:


Ensemble Methods:

For each classifier a comparison of the scaling impact is performed, as well as an accuracy comparison using different data (PCA,FAMD,KernelPCA and original). Two metrics are used to compare models’ performance – Accuracy and ROC-AUC score

III.1. Logistic Regression Classification

III.2 Support Vector Machine

III.3 K-Nearest Neighbors

III.4. Naive Bayes

III.5 Random Forest Classification

Hyperparameter tuning can be very time consuming with RandomForrestClassifier, due to the number of hyperparameters tuned. Moreover, the number of Decision Trees has a direct impact on the runtime. Because of time constraint, the classifier tuned with RandomizedSearch will be used

III.6. ADABoost Classifier

ADABoost base estimator supports only estimators with the parameter sample_weight:

III.7. Voting Classifier

Similar to ADABoost, voting classifier can only use estimator having predict_proba method.

Ensemble methods are very similar to the 'simple methods' in term of accuracy on the original dataset. However ensemble methods with FAMD features usually leads to a better accuracy (above 0.9 expected).

Improvement possibilities

This project could yield better results with the following additions: